1 - Introduction


In this analysis, we will analyze Formula 1 data from the year 2,000 and onward. The reason for this time frame is because of two things: the first begin that there no active drivers that began their F1 careers before the year 2,000, and the second being that, beginning in the year 2,000, it was mandated that all racecars have the same engine size, which would change overtime.



Furthermore, our analysis dives into the relationship between a driver’s starting position and the likelihood of obtaining a certain finishing result. One would assume that the better one qualifies, the higher the change of winning the race. Also, it is known that there are a few big teams/drivers who have dominated the sport over the years and continue to do so, thus providing relatively consistent results. One can easily assume that certain teams are much more likely to win a Grand Prix thus those teams are more statistically significant in determining the result of a race. Because of this, we also investigate the relationship between the results/points given the constructors (teams) and drivers.

Lastly, we create several prediction models that allow for predicting a race results. We use ONLY the parameters that one would have available just before a race: the team name, the driver name, the era (engine size) and the starting position.


2 - Packages


knitr::opts_chunk$set(echo = TRUE)

library(dplyr)
library(ggplot2)
library(caret)
library(class)


3 - EDA


Starting Position Vs Race Result
Unfortunately, due to lack of programming knowledge, we were unable to create the plot that we wanted. However, we found exactly what we needed online and thus decided to include that plot with the source tagged just below it.

From the plot, one can see that there does in-fact appear to be a relationship between a driver’s starting position. It is clear that the higher one starts on the grid, the higher the probability of winning the race. In addition these findings, we also noted that the lower one starts on the grid, the higher the probability of not finishing the race (DFN), which could make sense if one assumes that drivers towards the back are driving much less conservatively and thus more aggressive than those towards the front.

Result Probability By Qualifying Position Source: MrBookman_LibraryCop”(Reddit User)
Click Here to Visit Source


Constructor/Driver Vs Race Result


4 - Modeling Data


The data includes all race results since the year 2,000 and the following attributes:
  • Drive name/id // currently only the 25 drivers with the most race results have a unique name. All other drivers have there names labeled ‘Other’.
  • Constructor name/id // currently only the 25 constructors with the most reace results have a unique name. All other constructors have there names labeled ‘Other’.
  • ace name/id.
  • Starting position labeled ‘grid’.
  • inishing position labeled ‘positionOrder’.
  • The altitude of a track.
  • The era of engine size.
F1DataFrame = read.csv("ModelingData/F1ModelingCustomCategoricalVariables.csv")

head(F1DataFrame)
##   X driverId constructorId circuitId raceId year                name.x resultId
## 1 1        1             1         3      4 2009    Bahrain Grand Prix     7617
## 2 2        1           131        32    943 2015    Mexican Grand Prix    22858
## 3 3        1           131        14    892 2013    Italian Grand Prix    21962
## 4 4        1           131        35    894 2013     Korean Grand Prix    22002
## 5 5        1           131        73    992 2018 Azerbaijan Grand Prix    23842
## 6 6        1           131         4    884 2013    Spanish Grand Prix    21811
##   grid positionOrder                         name.y  alt     name forename
## 1    5             4  Bahrain International Circuit    7  McLaren    Lewis
## 2    2             2   Autódromo Hermanos Rodríguez 2227 Mercedes    Lewis
## 3   12             9   Autodromo Nazionale di Monza  162 Mercedes    Lewis
## 4    2             5   Korean International Circuit    0 Mercedes    Lewis
## 5    2             1              Baku City Circuit   -7 Mercedes    Lewis
## 6    2            12 Circuit de Barcelona-Catalunya  109 Mercedes    Lewis
##    surname       fullname                  era
## 1 Hamilton Lewis Hamilton        90 degrees V8
## 2 Hamilton Lewis Hamilton 90 degrees V6 + MGUs
## 3 Hamilton Lewis Hamilton        90 degrees V8
## 4 Hamilton Lewis Hamilton        90 degrees V8
## 5 Hamilton Lewis Hamilton 90 degrees V6 + MGUs
## 6 Hamilton Lewis Hamilton        90 degrees V8


Logic to split data into training and testing sets.

modelVariables = c('positionOrder', 'grid', 'name', 'fullname', 'era')
modelDataframe = F1DataFrame[, modelVariables]

set.seed(4)

trainingIndices = sample(c(1:dim(modelDataframe)[1]), dim(modelDataframe)[1]*0.8)

trainingDataframe = modelDataframe[trainingIndices,]
testingDataframe = modelDataframe[-trainingIndices,]


5 - Model 1 (Liner Regression)


The Model
mymodel<-lm(positionOrder ~ grid + name + fullname + era, data = F1DataFrame)

Residuals Vs Fitted Plot
  • Shows evidence of a negative slope with variance that seems to increase as fitted values increase.
  • This means the model is on average predicting worse results for drivers who finish with a good results.
  • As the fitted values increase it seems to get a wider difference in predicted vs actual results.
Q-Q Plot
  • Suggests a normal distribution can be assumed
Scale-Location Plot
  • Provides more evidence for normality.
  • There seems to be a patterns in starting at each discrete fitted value and traveling parallel to the Y axis with a sin function but with sharp 90 degree transitions.
  • Currently we don’t have an explanation for why this is.
mymodel<-lm(positionOrder ~ grid + name + fullname + era, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)


Model Summary Statistics

  • The model doesn’t not perform well With R-squared values at 0.27.
  • Surprisingly starting position doesn’t appear to be statistically significant which doesn’t make sense. One would expect there to be a strong correlation between starting position and finishing position.
summary(mymodel)
## 
## Call:
## lm(formula = positionOrder ~ grid + name + fullname + era, data = F1DataFrame)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.006  -3.505  -1.091   2.737  19.235 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  10.38543    0.66271  15.671  < 2e-16 ***
## grid                          0.35069    0.01151  30.466  < 2e-16 ***
## nameAlphaTauri               -1.00718    0.60703  -1.659 0.097110 .  
## nameAlpine F1 Team           -2.01517    0.67899  -2.968 0.003006 ** 
## nameArrows                    1.15489    0.69475   1.662 0.096486 .  
## nameBAR                      -0.29271    0.57947  -0.505 0.613477    
## nameBMW Sauber               -3.29514    0.63922  -5.155 2.59e-07 ***
## nameCaterham                  1.37396    0.63183   2.175 0.029687 *  
## nameFerrari                  -3.44972    0.45384  -7.601 3.23e-14 ***
## nameForce India              -1.61483    0.51394  -3.142 0.001683 ** 
## nameHaas F1 Team              0.35787    0.56938   0.629 0.529670    
## nameHonda                    -0.41443    0.68954  -0.601 0.547838    
## nameHRT                       2.93411    0.64075   4.579 4.73e-06 ***
## nameJaguar                    0.44880    0.60081   0.747 0.455086    
## nameJordan                    0.84923    0.57738   1.471 0.141369    
## nameLotus F1                 -1.71388    0.61496  -2.787 0.005331 ** 
## nameMarussia                  1.35399    0.63763   2.123 0.033741 *  
## nameMcLaren                  -1.58805    0.45032  -3.526 0.000423 ***
## nameMercedes                 -2.99824    0.51277  -5.847 5.17e-09 ***
## nameMinardi                   0.40154    0.58324   0.688 0.491174    
## nameOther                     0.33400    0.46587   0.717 0.473434    
## nameRed Bull                 -2.80957    0.49236  -5.706 1.19e-08 ***
## nameRenault                  -1.48916    0.47895  -3.109 0.001882 ** 
## nameSauber                   -0.34052    0.47023  -0.724 0.468988    
## nameToro Rosso               -0.25803    0.47321  -0.545 0.585583    
## nameToyota                   -0.93973    0.54566  -1.722 0.085074 .  
## nameWilliams                 -0.14091    0.44700  -0.315 0.752583    
## fullnameCarlos Sainz         -2.07692    0.64778  -3.206 0.001350 ** 
## fullnameDaniel Ricciardo     -2.23405    0.61061  -3.659 0.000255 ***
## fullnameDavid Coulthard      -1.06185    0.66215  -1.604 0.108830    
## fullnameFelipe Massa         -2.27072    0.59020  -3.847 0.000120 ***
## fullnameFernando Alonso      -2.73761    0.57253  -4.782 1.77e-06 ***
## fullnameGiancarlo Fisichella -2.04084    0.60956  -3.348 0.000817 ***
## fullnameJarno Trulli         -0.58987    0.62167  -0.949 0.342725    
## fullnameJenson Button        -2.98734    0.58761  -5.084 3.77e-07 ***
## fullnameKevin Magnussen      -1.41945    0.71095  -1.997 0.045903 *  
## fullnameKimi Räikkönen       -2.19286    0.58081  -3.776 0.000161 ***
## fullnameLance Stroll         -3.01532    0.68531  -4.400 1.10e-05 ***
## fullnameLewis Hamilton       -4.51906    0.62100  -7.277 3.69e-13 ***
## fullnameMark Webber          -1.60402    0.61988  -2.588 0.009679 ** 
## fullnameMax Verstappen       -3.29763    0.67780  -4.865 1.16e-06 ***
## fullnameMichael Schumacher   -1.67495    0.65920  -2.541 0.011074 *  
## fullnameNick Heidfeld        -1.71395    0.64194  -2.670 0.007599 ** 
## fullnameNico Hülkenberg      -1.19213    0.58740  -2.029 0.042437 *  
## fullnameNico Rosberg         -2.62344    0.65277  -4.019 5.89e-05 ***
## fullnameOther                -1.13431    0.49616  -2.286 0.022266 *  
## fullnameRalf Schumacher      -2.16772    0.68905  -3.146 0.001661 ** 
## fullnameRomain Grosjean      -0.79773    0.70348  -1.134 0.256832    
## fullnameRubens Barrichello   -1.96696    0.63711  -3.087 0.002026 ** 
## fullnameSebastian Vettel     -3.13173    0.58791  -5.327 1.02e-07 ***
## fullnameSergio Pérez         -2.86729    0.56115  -5.110 3.29e-07 ***
## fullnameValtteri Bottas      -3.52762    0.64298  -5.486 4.21e-08 ***
## era90 degrees V8              0.13583    0.15134   0.897 0.369477    
## eraV10                       -1.21093    0.21407  -5.657 1.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.943 on 9106 degrees of freedom
## Multiple R-squared:  0.3659, Adjusted R-squared:  0.3622 
## F-statistic: 99.12 on 53 and 9106 DF,  p-value: < 2.2e-16


Investigation
  • The grouping of low starting positions and high finishing positions during the era of up to 12 cylinders explains the irregularities from normality in the models residual plots and lack of statistical significance in the model for stating position.
mymodel<-lm(positionOrder ~ grid, data = F1DataFrame)

ggplot(data = F1DataFrame, aes(x = grid, y = positionOrder, colour = era)) + 
  geom_point(position = "jitter", size = .8) +
  labs(title="starting vs finishing position by era", 
     y="finishing position", 
     x="starting position")



  • Plots look similar to the full model as expected.
mymodel<-lm(positionOrder ~ grid, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)


6 - Model 2 (Liner Regression)


  • The residual plots look similar to the first model.
mymodel<-lm(positionOrder ~ grid + name + fullname + era + grid:era, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)


Model 2 Summary Statistics
  • The model performs significantly better With R-squared values going from 0.27 to 0.32.
summary(mymodel)
## 
## Call:
## lm(formula = positionOrder ~ grid + name + fullname + era + grid:era, 
##     data = F1DataFrame)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.305  -3.527  -1.062   2.713  19.492 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  10.18992    0.68222  14.936  < 2e-16 ***
## grid                          0.35685    0.01675  21.306  < 2e-16 ***
## nameAlphaTauri               -0.96330    0.60783  -1.585 0.113043    
## nameAlpine F1 Team           -2.01214    0.68050  -2.957 0.003116 ** 
## nameArrows                    1.68047    0.70947   2.369 0.017875 *  
## nameBAR                      -0.18805    0.58039  -0.324 0.745948    
## nameBMW Sauber               -3.24381    0.63884  -5.078 3.90e-07 ***
## nameCaterham                  1.19126    0.63429   1.878 0.060402 .  
## nameFerrari                  -3.45016    0.45662  -7.556 4.56e-14 ***
## nameForce India              -1.61940    0.51491  -3.145 0.001666 ** 
## nameHaas F1 Team              0.37112    0.56909   0.652 0.514330    
## nameHonda                    -0.41850    0.69192  -0.605 0.545302    
## nameHRT                       2.58153    0.65514   3.940 8.20e-05 ***
## nameJaguar                    0.77193    0.60813   1.269 0.204344    
## nameJordan                    1.19768    0.58538   2.046 0.040788 *  
## nameLotus F1                 -1.64706    0.61482  -2.679 0.007399 ** 
## nameMarussia                  1.13948    0.64156   1.776 0.075747 .  
## nameMcLaren                  -1.58150    0.45005  -3.514 0.000443 ***
## nameMercedes                 -2.96328    0.52171  -5.680 1.39e-08 ***
## nameMinardi                   1.17922    0.62166   1.897 0.057874 .  
## nameOther                     0.32732    0.46878   0.698 0.485051    
## nameRed Bull                 -2.67785    0.49441  -5.416 6.24e-08 ***
## nameRenault                  -1.48690    0.47883  -3.105 0.001907 ** 
## nameSauber                   -0.28114    0.47039  -0.598 0.550072    
## nameToro Rosso               -0.30402    0.47383  -0.642 0.521137    
## nameToyota                   -0.71284    0.54921  -1.298 0.194339    
## nameWilliams                 -0.19097    0.44755  -0.427 0.669601    
## fullnameCarlos Sainz         -1.92574    0.64850  -2.970 0.002990 ** 
## fullnameDaniel Ricciardo     -2.15690    0.61099  -3.530 0.000417 ***
## fullnameDavid Coulthard      -1.13276    0.66200  -1.711 0.087092 .  
## fullnameFelipe Massa         -2.00781    0.59393  -3.381 0.000726 ***
## fullnameFernando Alonso      -2.56675    0.57460  -4.467 8.03e-06 ***
## fullnameGiancarlo Fisichella -1.91949    0.60986  -3.147 0.001653 ** 
## fullnameJarno Trulli         -0.71872    0.62250  -1.155 0.248299    
## fullnameJenson Button        -2.81085    0.58919  -4.771 1.86e-06 ***
## fullnameKevin Magnussen      -1.31376    0.71091  -1.848 0.064634 .  
## fullnameKimi Räikkönen       -2.05271    0.58146  -3.530 0.000417 ***
## fullnameLance Stroll         -2.88345    0.68776  -4.192 2.79e-05 ***
## fullnameLewis Hamilton       -4.29182    0.62320  -6.887 6.09e-12 ***
## fullnameMark Webber          -1.49148    0.62177  -2.399 0.016470 *  
## fullnameMax Verstappen       -3.24201    0.67972  -4.770 1.87e-06 ***
## fullnameMichael Schumacher   -1.76263    0.66470  -2.652 0.008021 ** 
## fullnameNick Heidfeld        -1.50224    0.64358  -2.334 0.019606 *  
## fullnameNico Hülkenberg      -1.05863    0.58787  -1.801 0.071766 .  
## fullnameNico Rosberg         -2.43295    0.65492  -3.715 0.000205 ***
## fullnameOther                -1.04727    0.49630  -2.110 0.034871 *  
## fullnameRalf Schumacher      -2.27859    0.68996  -3.302 0.000962 ***
## fullnameRomain Grosjean      -0.69624    0.70355  -0.990 0.322388    
## fullnameRubens Barrichello   -1.97576    0.63877  -3.093 0.001987 ** 
## fullnameSebastian Vettel     -2.94648    0.58946  -4.999 5.88e-07 ***
## fullnameSergio Pérez         -2.77168    0.56127  -4.938 8.02e-07 ***
## fullnameValtteri Bottas      -3.38177    0.64398  -5.251 1.54e-07 ***
## era90 degrees V8             -0.19567    0.28781  -0.680 0.496616    
## eraV10                       -0.45452    0.35751  -1.271 0.203627    
## grid:era90 degrees V8         0.03029    0.02157   1.404 0.160292    
## grid:eraV10                  -0.08097    0.02809  -2.882 0.003959 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.939 on 9104 degrees of freedom
## Multiple R-squared:  0.367,  Adjusted R-squared:  0.3632 
## F-statistic: 95.97 on 55 and 9104 DF,  p-value: < 2.2e-16


7 - Model 3 (Liner Regression)


  • Taking out the data from the 12 cylinders era further improves the R-squared values to 0.39
  • Surprisingly starting position doesn’t appear to be statistically significant which doesn’t make sense. One would expect there to be a strong correlation between starting position and finishing position.
  • This model appears to be the best and provides the highest R-squared of all the models, but not the least amount of error.
F1DataFrame = F1DataFrame %>% filter(F1DataFrame$year >= "2000")
mymodel = lm(positionOrder ~ grid + name + fullname + era + grid:era, data = F1DataFrame)
simpleModel = lm(positionOrder ~ grid, data = F1DataFrame)
par(mfrow = c(1,1))
plot(mymodel)


Model 3 Summary Statistics

summary(mymodel)
## 
## Call:
## lm(formula = positionOrder ~ grid + name + fullname + era + grid:era, 
##     data = F1DataFrame)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.305  -3.527  -1.062   2.713  19.492 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  10.18992    0.68222  14.936  < 2e-16 ***
## grid                          0.35685    0.01675  21.306  < 2e-16 ***
## nameAlphaTauri               -0.96330    0.60783  -1.585 0.113043    
## nameAlpine F1 Team           -2.01214    0.68050  -2.957 0.003116 ** 
## nameArrows                    1.68047    0.70947   2.369 0.017875 *  
## nameBAR                      -0.18805    0.58039  -0.324 0.745948    
## nameBMW Sauber               -3.24381    0.63884  -5.078 3.90e-07 ***
## nameCaterham                  1.19126    0.63429   1.878 0.060402 .  
## nameFerrari                  -3.45016    0.45662  -7.556 4.56e-14 ***
## nameForce India              -1.61940    0.51491  -3.145 0.001666 ** 
## nameHaas F1 Team              0.37112    0.56909   0.652 0.514330    
## nameHonda                    -0.41850    0.69192  -0.605 0.545302    
## nameHRT                       2.58153    0.65514   3.940 8.20e-05 ***
## nameJaguar                    0.77193    0.60813   1.269 0.204344    
## nameJordan                    1.19768    0.58538   2.046 0.040788 *  
## nameLotus F1                 -1.64706    0.61482  -2.679 0.007399 ** 
## nameMarussia                  1.13948    0.64156   1.776 0.075747 .  
## nameMcLaren                  -1.58150    0.45005  -3.514 0.000443 ***
## nameMercedes                 -2.96328    0.52171  -5.680 1.39e-08 ***
## nameMinardi                   1.17922    0.62166   1.897 0.057874 .  
## nameOther                     0.32732    0.46878   0.698 0.485051    
## nameRed Bull                 -2.67785    0.49441  -5.416 6.24e-08 ***
## nameRenault                  -1.48690    0.47883  -3.105 0.001907 ** 
## nameSauber                   -0.28114    0.47039  -0.598 0.550072    
## nameToro Rosso               -0.30402    0.47383  -0.642 0.521137    
## nameToyota                   -0.71284    0.54921  -1.298 0.194339    
## nameWilliams                 -0.19097    0.44755  -0.427 0.669601    
## fullnameCarlos Sainz         -1.92574    0.64850  -2.970 0.002990 ** 
## fullnameDaniel Ricciardo     -2.15690    0.61099  -3.530 0.000417 ***
## fullnameDavid Coulthard      -1.13276    0.66200  -1.711 0.087092 .  
## fullnameFelipe Massa         -2.00781    0.59393  -3.381 0.000726 ***
## fullnameFernando Alonso      -2.56675    0.57460  -4.467 8.03e-06 ***
## fullnameGiancarlo Fisichella -1.91949    0.60986  -3.147 0.001653 ** 
## fullnameJarno Trulli         -0.71872    0.62250  -1.155 0.248299    
## fullnameJenson Button        -2.81085    0.58919  -4.771 1.86e-06 ***
## fullnameKevin Magnussen      -1.31376    0.71091  -1.848 0.064634 .  
## fullnameKimi Räikkönen       -2.05271    0.58146  -3.530 0.000417 ***
## fullnameLance Stroll         -2.88345    0.68776  -4.192 2.79e-05 ***
## fullnameLewis Hamilton       -4.29182    0.62320  -6.887 6.09e-12 ***
## fullnameMark Webber          -1.49148    0.62177  -2.399 0.016470 *  
## fullnameMax Verstappen       -3.24201    0.67972  -4.770 1.87e-06 ***
## fullnameMichael Schumacher   -1.76263    0.66470  -2.652 0.008021 ** 
## fullnameNick Heidfeld        -1.50224    0.64358  -2.334 0.019606 *  
## fullnameNico Hülkenberg      -1.05863    0.58787  -1.801 0.071766 .  
## fullnameNico Rosberg         -2.43295    0.65492  -3.715 0.000205 ***
## fullnameOther                -1.04727    0.49630  -2.110 0.034871 *  
## fullnameRalf Schumacher      -2.27859    0.68996  -3.302 0.000962 ***
## fullnameRomain Grosjean      -0.69624    0.70355  -0.990 0.322388    
## fullnameRubens Barrichello   -1.97576    0.63877  -3.093 0.001987 ** 
## fullnameSebastian Vettel     -2.94648    0.58946  -4.999 5.88e-07 ***
## fullnameSergio Pérez         -2.77168    0.56127  -4.938 8.02e-07 ***
## fullnameValtteri Bottas      -3.38177    0.64398  -5.251 1.54e-07 ***
## era90 degrees V8             -0.19567    0.28781  -0.680 0.496616    
## eraV10                       -0.45452    0.35751  -1.271 0.203627    
## grid:era90 degrees V8         0.03029    0.02157   1.404 0.160292    
## grid:eraV10                  -0.08097    0.02809  -2.882 0.003959 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.939 on 9104 degrees of freedom
## Multiple R-squared:  0.367,  Adjusted R-squared:  0.3632 
## F-statistic: 95.97 on 55 and 9104 DF,  p-value: < 2.2e-16
summary(simpleModel)
## 
## Call:
## lm(formula = positionOrder ~ grid, data = F1DataFrame)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.289  -3.824  -1.086   3.015  18.930 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 5.070542   0.109514   46.30   <2e-16 ***
## grid        0.550774   0.008669   63.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.157 on 9158 degrees of freedom
## Multiple R-squared:  0.3059, Adjusted R-squared:  0.3058 
## F-statistic:  4036 on 1 and 9158 DF,  p-value: < 2.2e-16


8 Plots for Each Era


  • Seems to show a weak but significant correlation between better starting positions with better finishing positions.
  • All plots appear to have comparable distributions.
  • V10plot = F1DataFrame %>% filter(F1DataFrame$era == "V10") %>% ggplot(aes(x = grid, y = positionOrder, colour = era)) + 
      geom_point(position = "jitter", size = .8) +
      labs(title="starting vs finishing position by era", 
         y="finishing position", 
         x="starting position")
    
    V8plot = F1DataFrame %>% filter(F1DataFrame$era == "90 degrees V8") %>% ggplot(aes(x = grid, y = positionOrder, colour = era)) + 
      geom_point(position = "jitter", size = .8) +
      labs(title="starting vs finishing position by era", 
         y="finishing position", 
         x="starting position")
    
    V8KERSplot = F1DataFrame %>% filter(F1DataFrame$era == "90 degrees V8 + KERS") %>% ggplot(aes(x = grid, y = positionOrder, colour = era)) + 
      geom_point(position = "jitter", size = .8) +
      labs(title="starting vs finishing position by era", 
         y="finishing position", 
         x="starting position")
    
    V8MGUsplot = F1DataFrame %>% filter(F1DataFrame$era == "90 degrees V6 + MGUs") %>% ggplot(aes(x = grid, y = positionOrder, colour = era)) + 
      geom_point(position = "jitter", size = .8) +
      labs(title="starting vs finishing position by era", 
         y="finishing position", 
         x="starting position")
    
    V10plot

    V8plot

    V8KERSplot

    V8MGUsplot


    9 - Model 4 (KNN)


    • This model also performed well relative to the first two. It also provides the least amount of error, so we chose this as the best model - regardless of Model 3 having a higher R-squared.
    set.seed(1234)
    fitControl<-trainControl(method="repeatedcv",number=10,repeats=1)
    knn.fit<-train(positionOrder~.,
                   data=modelDataframe,
                   method="knn",
                   trControl=fitControl,
                   tuneGrid=expand.grid(k=c(1:10,20,30))
                   )
    knn.fit
    ## k-Nearest Neighbors 
    ## 
    ## 9160 samples
    ##    4 predictor
    ## 
    ## No pre-processing
    ## Resampling: Cross-Validated (10 fold, repeated 1 times) 
    ## Summary of sample sizes: 8244, 8244, 8244, 8244, 8243, 8244, ... 
    ## Resampling results across tuning parameters:
    ## 
    ##   k   RMSE      Rsquared   MAE     
    ##    1  5.532697  0.2637072  4.210911
    ##    2  5.300432  0.2979288  4.086467
    ##    3  5.198577  0.3150419  4.022011
    ##    4  5.146384  0.3242655  3.991361
    ##    5  5.098006  0.3338238  3.959075
    ##    6  5.060262  0.3413419  3.934518
    ##    7  5.042913  0.3446800  3.925552
    ##    8  5.032698  0.3466881  3.916071
    ##    9  5.021445  0.3489045  3.911490
    ##   10  5.008533  0.3515009  3.906285
    ##   20  4.981941  0.3562147  3.897474
    ##   30  4.969751  0.3587330  3.898560
    ## 
    ## RMSE was used to select the optimal model using the smallest value.
    ## The final value used for the model was k = 30.


    # THANK YOU!